code context
Hidden in Plain Sight: Where Developers Confess Self-Admitted Technical Debt
Sridharan, Murali, Robredo, Mikel, Rantala, Leevi, Esposito, Matteo, Lenarduzzi, Valentina, Mantyla, Mika
Context. Detecting Self-Admitted Technical Debt (SATD) is crucial for proactive software maintenance. Previous research has primarily targeted detecting and prioritizing SATD, with little focus on the source code afflicted with SATD. Our goal in this work is to connect the SATD comments with source code constructs that surround them. Method. We leverage the extensive SATD dataset PENTACET, containing code comments from over 9000 Java Open Source Software (OSS) repositories. We quantitatively infer where SATD most commonly occurs and which code constructs/statements it most frequently affects. Results and Conclusions. Our large-scale study links over 225,000 SATD comments to their surrounding code, showing that SATD mainly arises in inline code near definitions, conditionals, and exception handling, where developers face uncertainty and trade-offs, revealing it as an intentional signal of awareness during change rather than mere neglect.
SpareCodeSearch: Searching for Code Context When You Have No Spare GPU
--Retrieval-Augmented Generation (RAG) frameworks aim to enhance Code Language Models (CLMs) by including another module for retrieving relevant context to construct the input prompt. However, these retrieval modules commonly use semantic search, requiring substantial computational resources for training and hosting these embedded models, making them infeasible to integrate into lightweight applications such as in-IDE AI-based code completion. In this solution paper, we prove that using keyword-search is sufficient to retrieve relevant and useful code context inside large codebases, without the need for extensive GPU resources. The usefulness of code contexts found by our solution is demonstrated through their completion results on the Code Context Competition's benchmark, reaching 0.748 and 0.725 chRF scores on Kotlin and Python tracks, respectively. Code Language Models (CLMs) have shown great promise in generating code, given the right contexts in their input prompts [1], [2].
REFINE: Enhancing Program Repair Agents through Context-Aware Patch Refinement
Pabba, Anvith, Chen, Simin, Mathai, Alex, Chakraborty, Anindya, Ray, Baishakhi
Large Language Models (LLMs) have recently shown strong potential in automatic program repair (APR), especially in repository-level settings where the goal is to generate patches based on natural language issue descriptions, large codebases, and regression tests. However, despite their promise, current LLM-based APR techniques often struggle to produce correct fixes due to limited understanding of code context and over-reliance on incomplete test suites. As a result, they frequently generate Draft Patches-partially correct patches that either incompletely address the bug or overfit to the test cases. In this work, we propose a novel patch refinement framework, Refine, that systematically transforms Draft Patches into correct ones. Refine addresses three key challenges: disambiguating vague issue and code context, diversifying patch candidates through test-time scaling, and aggregating partial fixes via an LLM-powered code review process. We implement Refine as a general refinement module that can be integrated into both open-agent-based and workflow-based APR systems. Our evaluation on the SWE-Bench Lite benchmark shows that Refine achieves state-of-the-art results among workflow-based approaches and approaches the best-known performance across all APR categories. Specifically, Refine boosts AutoCodeRover's performance by 14.67%, achieving a score of 51.67% and surpassing all prior baselines. On SWE-Bench Verified, Refine improves the resolution rate by 12.2%, and when integrated across multiple APR systems, it yields an average improvement of 14%-demonstrating its broad effectiveness and generalizability. These results highlight the effectiveness of refinement as a missing component in current APR pipelines and the potential of agentic collaboration in closing the gap between near-correct and correct patches. We also open source our code.
What Do They Fix? LLM-Aided Categorization of Security Patches for Critical Memory Bugs
Li, Xingyu, Pu, Juefei, Wu, Yifan, Zou, Xiaochen, Zhu, Shitong, Zou, Xiaochen, Zhu, Shitong, Wu, Qiushi, Zhang, Zheng, Hsu, Joshua, Dong, Yue, Qian, Zhiyun, Lu, Kangjie, Jaeger, Trent, De Lucia, Michael, Krishnamurthy, Srikanth V.
Open-source software projects are foundational to modern software ecosystems, with the Linux kernel standing out as a critical exemplar due to its ubiquity and complexity. Although security patches are continuously integrated into the Linux mainline kernel, downstream maintainers often delay their adoption, creating windows of vulnerability. A key reason for this lag is the difficulty in identifying security-critical patches, particularly those addressing exploitable vulnerabilities such as out-of-bounds (OOB) accesses and use-after-free (UAF) bugs. This challenge is exacerbated by intentionally silent bug fixes, incomplete or missing CVE assignments, delays in CVE issuance, and recent changes to the CVE assignment criteria for the Linux kernel. While fine-grained patch classification approaches exist, they exhibit limitations in both coverage and accuracy. In this work, we identify previously unexplored opportunities to significantly improve fine-grained patch classification. Specifically, by leveraging cues from commit titles/messages and diffs alongside appropriate code context, we develop DUALLM, a dual-method pipeline that integrates two approaches based on a Large Language Model (LLM) and a fine-tuned small language model. DUALLM achieves 87.4% accuracy and an F1-score of 0.875, significantly outperforming prior solutions. Notably, DUALLM successfully identified 111 of 5,140 recent Linux kernel patches as addressing OOB or UAF vulnerabilities, with 90 true positives confirmed by manual verification (many do not have clear indications in patch descriptions). Moreover, we constructed proof-of-concepts for two identified bugs (one UAF and one OOB), including one developed to conduct a previously unknown control-flow hijack as further evidence of the correctness of the classification.
Impact of Code Context and Prompting Strategies on Automated Unit Test Generation with Modern General-Purpose Large Language Models
Walczak, Jakub, Tomalak, Piotr, Laskowski, Artur
--Generative AI is gaining increasing attention in software engineering, where testing remains an indispensable reliability mechanism. According to the widely adopted testing pyramid, unit tests constitute the majority of test cases and are often schematic, requiring minimal domain expertise. Automatically generating such tests under the supervision of software engineers can significantly enhance productivity during the development phase of the software lifecycle. This paper investigates the impact of code context and prompting strategies on the quality and adequacy of unit tests generated by various large language models (LLMs) across several families. The results show that including docstrings notably improves code adequacy, while further extending context to the full implementation yields definitely smaller gains. Notably, the chain-of-thought prompting strategy -- applied even to'reasoning' models -- achieves the best results, with up to 96.3% branch coverage, a 57% average mutation score, and near-perfect compilation success rate. Among the evaluated models, M5 (Gemini 2.5 Pro) demonstrated superior performance in both mutation score and branch coverage being still in top in terms of compilation success rate. ECENT years have brought significant advancements in artificial intelligence (AI), particularly in the areas of performance and productivity enhancement. However, AI -- and particularly large language models (LLMs) -- still suffer from several weaknesses. Among them, convincing but senseless content generation ('hallucination'), safety misalignment ('ethicality') [1], unfairness [2], and limited processing context are the most critical. In spite of these restrictions, and bearing in mind the limited and merely apparent creativity of LLMs [3], they have become versatile tools already widely used across a variety of domains (creative industries [4], entertainment, reporting, and software engineering [5] are just cases in point) for multiple tasks.
Descriptor: C++ Self-Admitted Technical Debt Dataset (CppSATD)
Pham, Phuoc, Sridharan, Murali, Esposito, Matteo, Lenarduzzi, Valentina
In software development, technical debt (TD) refers to suboptimal implementation choices made by the developers to meet urgent deadlines and limited resources, posing challenges for future maintenance. Self-Admitted Technical Debt (SATD) is a sub-type of TD, representing specific TD instances ``openly admitted'' by the developers and often expressed through source code comments. Previous research on SATD has focused predominantly on the Java programming language, revealing a significant gap in cross-language SATD. Such a narrow focus limits the generalizability of existing findings as well as SATD detection techniques across multiple programming languages. Our work addresses such limitation by introducing CppSATD, a dedicated C++ SATD dataset, comprising over 531,000 annotated comments and their source code contexts. Our dataset can serve as a foundation for future studies that aim to develop SATD detection methods in C++, generalize the existing findings to other languages, or contribute novel insights to cross-language SATD research.
Technical Challenges in Maintaining Tax Prep Software with Large Language Models
Gogani-Khiabani, Sina, Dewangan, Varsha, Olson, Nina, Trivedi, Ashutosh, Tizpaz-Niari, Saeid
As the US tax law evolves to adapt to ever-changing politico-economic realities, tax preparation software plays a significant role in helping taxpayers navigate these complexities. The dynamic nature of tax regulations poses a significant challenge to accurately and timely maintaining tax software artifacts. The state-of-the-art in maintaining tax prep software is time-consuming and error-prone as it involves manual code analysis combined with an expert interpretation of tax law amendments. We posit that the rigor and formality of tax amendment language, as expressed in IRS publications, makes it amenable to automatic translation to executable specifications (code). Our research efforts focus on identifying, understanding, and tackling technical challenges in leveraging Large Language Models (LLMs), such as ChatGPT and Llama, to faithfully extract code differentials from IRS publications and automatically integrate them with the prior version of the code to automate tax prep software maintenance.
Detection, Classification and Prevalence of Self-Admitted Aging Debt
Sridharan, Murali, Mรคntylรค, Mika, Rantala, Leevi
Context: Previous research on software aging is limited with focus on dynamic runtime indicators like memory and performance, often neglecting evolutionary indicators like source code comments and narrowly examining legacy issues within the TD context. Objective: We introduce the concept of Aging Debt (AD), representing the increased maintenance efforts and costs needed to keep software updated. We study AD through Self-Admitted Aging Debt (SAAD) observed in source code comments left by software developers. Method: We employ a mixed-methods approach, combining qualitative and quantitative analyses to detect and measure AD in software. This includes framing SAAD patterns from the source code comments after analysing the source code context, then utilizing the SAAD patterns to detect SAAD comments. In the process, we develop a taxonomy for SAAD that reflects the temporal aging of software and its associated debt. Then we utilize the taxonomy to quantify the different types of AD prevalent in OSS repositories. Results: Our proposed taxonomy categorizes temporal software aging into Active and Dormant types. Our extensive analysis of over 9,000+ Open Source Software (OSS) repositories reveals that more than 21% repositories exhibit signs of SAAD as observed from our gold standard SAAD dataset. Notably, Dormant AD emerges as the predominant category, highlighting a critical but often overlooked aspect of software maintenance. Conclusion: As software volume grows annually, so do evolutionary aging and maintenance challenges; our proposed taxonomy can aid researchers in detailed software aging studies and help practitioners develop improved and proactive maintenance strategies.
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution
Wei, Yuxiang, Duchenne, Olivier, Copet, Jade, Carbonneaux, Quentin, Zhang, Lingming, Fried, Daniel, Synnaeve, Gabriel, Singh, Rishabh, Wang, Sida I.
The recent DeepSeek-R1 release has demonstrated the immense potential of reinforcement learning (RL) in enhancing the general reasoning capabilities of large language models (LLMs). While DeepSeek-R1 and other follow-up work primarily focus on applying RL to competitive coding and math problems, this paper introduces SWE-RL, the first approach to scale RL-based LLM reasoning for real-world software engineering. Leveraging a lightweight rule-based reward (e.g., the similarity score between ground-truth and LLM-generated solutions), SWE-RL enables LLMs to autonomously recover a developer's reasoning processes and solutions by learning from extensive open-source software evolution data -- the record of a software's entire lifecycle, including its code snapshots, code changes, and events such as issues and pull requests. Trained on top of Llama 3, our resulting reasoning model, Llama3-SWE-RL-70B, achieves a 41.0% solve rate on SWE-bench Verified -- a human-verified collection of real-world GitHub issues. To our knowledge, this is the best performance reported for medium-sized (<100B) LLMs to date, even comparable to leading proprietary LLMs like GPT-4o. Surprisingly, despite performing RL solely on software evolution data, Llama3-SWE-RL has even emerged with generalized reasoning skills. For example, it shows improved results on five out-of-domain tasks, namely, function coding, library use, code reasoning, mathematics, and general language understanding, whereas a supervised-finetuning baseline even leads to performance degradation on average. Overall, SWE-RL opens up a new direction to improve the reasoning capabilities of LLMs through reinforcement learning on massive software engineering data.
Code Vulnerability Repair with Large Language Model using Context-Aware Prompt Tuning
Khan, Arshiya, Liu, Guannan, Gao, Xing
Large Language Models (LLMs) have shown significant challenges in detecting and repairing vulnerable code, particularly when dealing with vulnerabilities involving multiple aspects, such as variables, code flows, and code structures. In this study, we utilize GitHub Copilot as the LLM and focus on buffer overflow vulnerabilities. Our experiments reveal a notable gap in Copilot's abilities when dealing with buffer overflow vulnerabilities, with a 76% vulnerability detection rate but only a 15% vulnerability repair rate. To address this issue, we propose context-aware prompt tuning techniques designed to enhance LLM performance in repairing buffer overflow. By injecting a sequence of domain knowledge about the vulnerability, including various security and code contexts, we demonstrate that Copilot's successful repair rate increases to 63%, representing more than four times the improvement compared to repairs without domain knowledge.